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Abstract 

It is well known that the block Krylov subspace solvers work efficiently for 
some cases of the solution of differential equations with multiple right-hand 
sides. In lattice QCD calculation of physical quantities on a given configura- 
tion demands us to solve the Dirac equation with multiple sources. We show 
that a new block Krylov subspace algorithm recently proposed by the au- 
thors reduces the computational cost significantly without loosing numerical 
accuracy for the solution of the 0(a)-improved Wilson-Dirac equation. 
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1. Introduction 

In the last decade one of the primary issues in lattice QCD is to reduce 
the dynamical up and down quark masses toward the physical values. Most 
of our efforts have been devoted to reduce the computational cost for the 
configuration generation with light quark masses. Thanks to the algorithmic 
improvement together with rapid increase of the computational power we are 
now allowed to make a direct full QCD simulation on the physical up and 
down quark masses[l|, 0]. On the other hand, we have been less concerned 
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with the algorithmic improvement for the calculation of physical quantities: 
its computational cost was negligibly smaller than that for the configura- 
tion generation until recently. At present, however, the latter is drastically 
reduced to be comparable to or smaller than the former. 

The characteristic feature in the calculation of physical quantities is the 
solution of the Dirac equation with multiple sources: twelve in the simplest 
case and O(10 — 100) for the stochastic technique. These are typical exam- 
ples of the solution of differential equations with multiple right-hand sides. 
For this type of equations it is known that the block Krylov subspace solvers 
succeed in reducing the computational costj3]. In this article we study the 
application of the block BiCGSTAB algorithm^ and a new block Krylov 
subspace method proposed in Ref. [6] to the 0(a)-improved Wilson-Dirac 
equation which is one of the popular fermion formulations in current lat- 
tice QCD simulations. For simplicity our numerical test is restricted on the 
quenched configuration with a fixed volume. We investigate the quark mass 
dependence of the algorithmic efficiency in detail. 

This paper is organized as follows. In Sec. [2] we give the definition of the 
0(a)-improved Wilson-Dirac equation. The algorithmic details are described 
in Sec. |3j In Sec. @]we present the results of the numerical tests after explain- 
ing the parameter choice and the machine specifications. Our conclusions are 
summarized in Sec. EJ 

2. Wilson-Dirac equation 

The lattice QCD is defined on a hypercubic four- dimensional lattice of 
finite extent being expressed as L x x L y x L z x L t with L Xj y jZ the three- 
dimensional spatial extent and L t the temporal one. The lattice spacing is 
set to unity for notational convenience. The fields are defined on the sites n 
with periodic boundary conditions. 

We define two types of fields on the lattice. One is the gauge field rep- 
resented by U^(n,a,b) with /i = 1,2,3,4 and a, b = 1,2,3 which is a 3 x 3 
SU(3) matrix assigned on each link. The other is the quark field q(n, a, a) 
which resides on each site carrying the Dirac index a = 1, 2, 3, 4 and the color 
index a = 1,2,3. The 0(a)-improved Wilson-Dirac operator is written as 

D w (n,a,a;m,/3,b) = 5„, m <!> a ,/A,& 




M =l,...,4 
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+{<W + Jp(ai, P)}Ul(n - p,, a, &)5 TO ,„_ A ] 
-k-csw |<v( a >^)^( n > a, 6)5 m , n , (1) 

jU,f=l,...,4 

where fi denotes the unit vector in the \i direction. The coefficient csw is a 
parameter to be adjusted for the 0(a) improvement. The Euclidean gamma 
matrices are defined in terms of the Minkowski ones in the Bjorken-Drell 
convention: 7,- = -ij BD (j = 1,2,3), 74 = i° BEn 75 = 1 BD and o> = 
^[7^,7^]. The explicit representations for 71,2,3,4,5 are given by 
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where we list only nonzero elements. The field strength F^ v in the last term 
of Eq. (TTJ is expressed as 



1 4 1 

6 ) = ^2i{ Pi{n ' a ' b) ~ P ' {n ' a ' b) J 
i=i 



(7) 
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with 

Pi(n,a,b) = ) y U^jn, a, c)U v (n + /2, c, d)Ul{n + z>, d, e)Ul(n, e, b), (8) 

c,d,e 

P2(n,a,b) = U v (n, a, c)U^(n — ft + £>,c,d) 

c,d,e 

xU}(n- #,d,e)f7 M (n- £,e,&), (9) 
P 3 (n,a,b) = yjZ7^(n — fl,a,c)Ul(n — /t — v,c,d) 

c,d,e 

xU^(n — fi — u,d,e)U v (n — u,e,b), (10) 
P 4 (n, a,6) = S ^^Ul{n — v,a,c)U v {n — v,c,d) 

c,d,e 

xU v (n + jl-C>,d,e)Ul(n,e,b). (11) 

The 0(a)-improved Wilson-Dirac operator defined by Eq. (jTJ) is a complex 
non-Hermitian square matrix, where only 51 out of L x x x L z x x 3 x 4 
entries in each row have nonzero values. The matrix becomes fairly sparse 
in current numerical simulations with L XjVtZtt ~ O(10). 

The calculation of physical quantities requires the solution of the following 
linear equations: 

Dy/(n, a, a; m, (3, b)x^\m, f3, b) = s^\n, a, a), (12) 

m,/3,b 

where represents the z-th source vector. To illustrate the situation we 
consider the calculation of the hadron two-point function with the point 
source at the origin as a simplest example. In this case we need twelve 
source vectors expressed as 

with a = 1,2,3 and a = 1,2,3,4. Another good example is the stochastic 
technique which usually requires O(10 — 100) noise sources. Although we 
can think of a lot of other interesting examples, our numerical tests, which 
is explained later in Sec. HI concentrate on the simplest one. 
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3. Block Krylov subspace methods 



Before explaining block Krylov subspace algorithms it should be better to 
reformulate the problem in a generalized form. This help the readers easily 
understand the essence avoiding any complex notations specific to lattice 
QCD. 

Our interest exists in the solution of linear systems with the multiple 
right-hand sides expressed as 

AX = B, (14) 

where A is an N x N complex sparse non-Hermitian matrix. X and B are 
N x L complex rectangular matrices given by 

X=(xW (15) 
B= (6« ...,b« (16) 

In the case of the Wilson-Dirac equation N = L x x L y x L z x L t x 3 x 4 and 
L is the number of the source vectors. 

For a preparative purpose we first write down the well-known BiCGSTAB 
algorithm for solving a single right-hand side linear system, where x = x^ 
and b = bW in Eq. (fT4l): 

x is an initial guess, 
Compute r = b — Axo, 
Set p = r , 

Choose f such that (f , r ) ^ 0, 

For k — 0, 1, . . . , until || T*>t || 2/ 1| ^ || 2 < e do: 

«fc = (ro,r fc )/(fo, Ap k ), 
tk = r k - a k Ap k) 
Ck = (Atk,t k )/{At k ,Atk), 
x k +i = x k + a k Pk + Cktk, 

fk+l = tk — CkAtk, 

Pk = (oik/Ck) ■ (fo,r k+1 )/(r Q , r k ), 
Pk+i = r k+1 + (3 k {p k - ( k Ap k ), 
End for. 

It is rather straightforward to extend the algorithm to the blocked version 
for solving multiple right-hand sides linear system: 
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X G C NxL is an initial guess, 
Compute Rq = B — AX , 
Set P = Ro, 
Choose R e C^, 

For fc = 0, 1, until maxiQIrf^/Hb^lh) < e do: 
V k = AP k , 
Solve (RfV k )a k = R^R k for a k , 
T k = R k — V k a k , 

(I = Tr (Z«T k ) /Tr (ZfZ k ), 
Xk+i = X k + P k a k + ( k T k , 
Rk+l = T k — ( k Z k ,^ 
Solve (R*V k )f3 k = -R*Z k for 

End for, 

where R k ,P k ,T k are N x L complex rectangular matrices and a kl (3 k are 
L x L complex square ones. At the k-th iteration in the block BiCGSTAB 
algorithm we find 

X k G X + K.f(A;Ro), (17) 
R k G Kf +1 (A;Ro), (18) 

where /Cf (A; _R ) is a block Krylov subspace defined by 

{fc-i 
3=0 

This yields an essential difference from the consecutive application of the 
BiCGSTAB algorithm to Ax® = fcW for i = 1, . . . , L: The blocked version 
searches the solution vectors with the enlarged Krylov subspace. 

In Ref. j^] a new block Krylov subspace method is proposed. This method 
improves the numerical accuracy of the block BiCGSTAB method where 
multiplication of the matrix a k yields contaminations of the rounding error 
on the solution^. The algorithm is as follows: 

X G C nxL is an initial guess, 
Compute R = B — AX , 
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Set P = R and V = W = AR , 
Choose R eC nxL , 

For k = 0, 1, until max^Hrflla/H&^lh) < e do: 
Solve (^Vfc)a fc = -Rq i? fc for a fc , 



a 


= Tr^R,} /Tr [W*W k ], 


Sk 


= Pk — CkVk, 


u k 


= S k a k , 


Y k 


= AU k , 


Xk+i 


— X k + ( k R k + U k , 


Rk+l 


— Rk — (kWk — Y k , 


w k+1 


= AR k+ i, 



Solve (RoRkhk = RfR k+1 /( k for 7*, 



Pk+i — Rk+i + u^k, 

V fc+ i = W k+1 + Y k j k , 
End 



This algorithm is constructed to avoid the rounding error problem ob- 
served in the block BiCGSTAB method. 



4. Numerical tests 

4-1. Choice of parameters 

The L dependence of the efficiency of the block Krylov algorithms, in 
which we are most interested, should be investigated on single CPU avoid- 
ing contaminations due to the communication overhead. The memory re- 
quirement forces the lattice size to be moderate. Our numerical tests are 
performed with samples of 10 statistically independent gauge field configura- 
tions on a 16 3 x 32 lattice generated by the Iwasaki gauge action at (5 — 2.575 
in quenched approximation, which was employed in Ref. We solve the 
Wilson-Dirac equation with the local source at the origin choosing four hop- 
ping parameters k = 0.1359, 0.1357, 0.1355, 0.1300 with the improvement co- 
efficient csw — 1-345. The bare quark mass is given by m q = (1/ac — l//c c )/2 
with k c = 0.136116(8), which increases as k decreases. In Table [T] we list 
the pion mass in physical unit and the m^/nip ratio at each hopping param- 
eter following Ref. [3|. Although the pion mass at k — 0.1300 is extremely 
large from a view point of physical interest, we employ it for a comparative 
purpose. The lattice spacing estimated by m p is a = 0.1130 fm[i|. 
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Table 1: Pion mass and jm p ratio at each hopping parameter. 



K 


k c = 0.136116(8) 


i 0.1359 


0.1357 


0.1355 


0.1300 


m w [MeV] 





221 


307 


375 


1282 


m n /m p 





0.28 


0.39 


0.46 


0.87 



4-2. Test environment 

Numerical tests are carried out on single node of T2K-Tsukuba which is a 
large-scale cluster system 648 compute nodes providing 95.4Tflops of comput- 
ing capability. Each node consists of quad-socket, 2.3GHz Quad-Core AMD 
Opteron Model 8356 processors whose on-chip cache sizes are 64KBytes/core, 
512KBytes/core, 2MB/chip for LI, L2, L3, respectively. Each processor has 
a direct connect memory interface to an 8GBytes DDR2-667 memory and 
three hypertransport links to connect other processors. All the nodes in 
the system are connected through a full-bisectional fat-tree network consist- 
ing of four interconnection links of 8GBytes/sec aggregate bandwidth with 
Infiniband. 

4-3. Results 

Table [2] shows the L dependence of the computational cost to solve the 
Wilson-Dirac equation with the block BiCGSTAB algorithm imposing rather 
stringent tolerance e = 10~ 14 for the relative residual. We present averaged 
values over 10 configuration samples for the number of iteration and the exe- 
cution time. The true residual, which is defined by max, fr^l^/H^I^, 
is evaluated after the relative residual reaches the tolerance. The maximum 
and minimum values among 10 configuration samples are listed. Although 
the computational cost divided by L is considerably reduced as L increases 
at k — 0.1359,0.1357,0.1355, there exist two concerns: One is the discrep- 
ancy between the relative residual and the true one which is enhanced as L 
increases. The other is the convergence failure of the relative residual, which 
is found for 4 samples out of 10 at k — 0.1359 with L = 4. This flaw is also 
observed at k = 0.1355 and 0.1357 once we go beyond L = 4. 

In order to remove the discrepancy of the relative residual and true one, 
we proposed a new algorithm whose details are presented in Ref. jfjj]. The 
effectiveness of the new method is observed in Table [3j where the deviation 
between the true and the relative residuals essentially vanishes. Figure [1] 
plots the number of iteration and the execution time divided L as a function 
of L. We observe two important features. One is the acceleration of the 
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cost reduction at the light quark masses where the physical interest exists: 
At k — 0.1357 the number of iteration is almost cut in half from L = 1 
to 4, which should be compared to the case of k = 0.1300. The other is 
that the executional time divided by L decreases faster than the number 
of iteration as L increases: The former is reduced by more than 60% from 
L = 1 to 4 at k = 0.1357. This fact demonstrates the efficiency of the cache- 
aware implementation for the matrix multiplication on the multiple vectors. 
In Fig. [2] we show a representative behavior of the relative residual as a 
function of the number of iteration at each hopping parameter. We observe 
little L dependence up to 700 iterations, beyond which the convergence speed 
is accelerated as L increases at k = 0.1359,0.1357,0.1355. 



Table 2: L dependence of the number of iteration and the execution time to solve the 
Wilson-Dirac equations with the block BiCGSTAB method. True residual is evaluated 
after the relative residual reaches the tolerance e — 10~ 14 . 





L 


^iteration 


time 


time/L 


true residual 










[sec] 


[sec] 


max 




min 




0.1359 


1 


2432.6 


1439.3 


1439.3 


1.57 x 10' 


-14 


7.99 x 10- 


-15 




2 
4 


1696.8 


1316.0 


658.0 


2.91 x 10" 


-11 


2.36 x 10- 


-13 


0.1357 


1 


1999.0 


1162.4 


1162.4 


1.08 x 10- 


-14 


5.98 x 10- 


-15 




2 


1410.1 


1092.4 


546.2 


2.62 x 10- 


-12 


3.75 x 10- 


-14 




4 


1100.0 


1633.3 


408.3 


1.36 x 10- 


-11 


6.33 x 10- 


-1.3 


0.1355 


1 


1518.5 


884.4 


884.4 


1.09 x 10- 


-14 


5.25 x 10- 


-15 




2 


1264.2 


979.6 


489.8 


2.55 x 10- 


-12 


1.49 x 10- 


-14 




4 


961.7 


1430.0 


357.5 


1.30 x 10- 


-11 


2.30 x 10- 


-13 


0.1300 


1 


165.3 


96.4 


96.4 


9.29 x 10- 


-15 


3.68 x 10- 


-15 




2 


172.8 


134.3 


67.2 


1.57 x 10- 


-11 


6.14 x 10- 


-15 




4 


181.0 


272.7 


68.2 


3.42 x 10- 


-13 


7.99 x 10- 


-15 



5. Conclusions 

In this paper we present the first example for the efficiency of the block 
Krylov subspace methods in lattice QCD to solve the 0(a)-improved Wilson- 
Dirac equation with multiple local sources. We find remarkable cost reduc- 
tions for the light quark masses. Roughly speaking, the solver performance 
normalized by L is doubled by increasing L from one to four at the light 
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Table 3: L dependence of the number of iteration and the execution time to solve the 
Wilson-Dirac equations with the new method. True residual is evaluated after the relative 
residual reaches the tolerance e = 10~ 14 . 



K 


L 


^iteration 


time 


time/L 


true residual 










[sec] 


[sec] 


max 




min 




0.1359 


1 


2440.3 


1398.2 


1398.2 


1.53 x 10' 


-14 


8.57 x 10" 


-15 




2 
4 


1701.5 


1308.4 


654.2 


2.92 x 10" 


-13 


6.65 x 10- 


-15 


0.1357 


1 


1986.6 


1138.1 


1138.1 


1.14 x 10" 


-14 


6.81 x 10- 


-15 




2 


1417.0 


1090.1 


545.1 


1.08 x 10- 


-14 


8.51 x 10- 


-15 




4 


1063.6 


1556.6 


389.2 


1.51 x 10- 


-14 


6.72 x 10- 


-15 


0.1355 


1 


1519.0 


870.0 


870.0 


1.18 x 10" 


-14 


9.03 x 10- 


-15 




2 


1252.9 


962.9 


481.5 


1.08 x 10- 


-14 


7.68 x 10- 


-15 




4 


975.7 


1427.1 


356.8 


1.29 x 10" 


-14 


7.23 x 10- 


-15 


0.1300 


1 


165.4 


95.3 


95.3 


9.34 x 10- 


-15 


4.02 x 10- 


-15 




2 


173.1 


133.6 


66.8 


9.79 x 10- 


-15 


5.15 x 10- 


-15 




4 


181.7 


266.1 


66.5 


9.10 x 10- 


-15 


6.79 x 10- 


-15 



quark masses. The block Krylov subspace methods have two advantageous 
points. Firstly we can easily implement the method by extending the conven- 
tional solver for a single right hand side. The deviation between the relative 
residual and the true one observed in the block BiCGSTAB algorithm is suc- 
cessfully removed by the new algorithm proposed in Ref. |6j. Although our 
numerical tests are carried out on a single CPU, it is obvious that there is no 
difficulty in parallelization. Secondly the multiplication of the Wilson-Dirac 
matrix on the multiple vectors allow us an effective use of the cache, where 
the link variables can be retained during the operation. The cost reduction is 
achieved by not only the algorithmic efficiency but also the implementation 
technique. One concern about the methods is that the increase of L makes 
difficult the convergence of the relative residuals at light quark masses. We 
are now investigating its origin and possible improvements. 
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Figure 2: Representative behaviors of the relative residual as a function of the number of 
iteration. All the measurements are performed on the same configuration. 
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